0.1 tl;dr

Process:

  • 1st) Simulate population dataset based on questions from recruitment form. This represents a rough guess at the total population of those living in 30 - 60% AMI in Boulder City. This fake dataset is based on initial estimates and/or guesses on demographic parameters (including what the parameters should be).

  • 2nd) Randomly sample 4000 applicants from the simulated population data.

  • 3rd) Select first ‘wave’ of 200 program selections using weighted samples.

  • 4th) Select second and third waves using propensity score matching against the applicant pool.

  • 5th) Make list of additional backups to use for additional verifications if needed. Define a process for selecting these additional backup selections, based on prioritizing the least represented groups.

  • Last) Make the dataset with selections and backups available for download

0.2 Assumptions

  • There will be enough recruits into the program that we can have multiple waves of selections within the weighting criteria we define.

  • Failures of verification will be ~randomly distributed across groups.

  • For the sake of the simulations and calculations here (which are just for an abstract presentation of the process), assume there will be 4000 applicants, 200 selections, and 200 backups.

  • For the purposes of weighting, assume groups are independent. That is, we have estimates for the proportion of the population by racial category and we use these weights to make a random selection, likewise with gender, and disability, etc.

0.3 Requirements

  • Ideally, make all matches based on estimates of population in Boulder City who are either a) between 30 and 60 % of area median income (AMI) or b) below poverty line. Option a is preferable - b is backup if we encounter data limilations.

  • Proportionate match by race/ethnicty

  • Proportionate match by gender identity

  • Individuals with children under 18 should be represented at 2xs their estimated representation in the population (population is from first bullet on income)

  • Proportionate match by disability status

0.4 Questionaire info

The eligibility questionnaire will have questions on each of the above, plus additional eligibility and other characteristics not addressed here.

Ethnicity/race options:

  • Non-Latino White (e.g., German, Irish, English, Italian)

  • Hispanic, Latinx, or Spanish origin (e.g., Mexican/Mexican American, Puerto Rican, Cuban, Dominican, Salvadoran, Colombian)

  • Black or African American (e.g., African American, Jamaican, Haitian, Nigerian, Ethiopian, Somalian)

  • Asian (e.g., Chinese, Filipino, Asian Indian, Vietnamese, Korean, Japanese)

  • American Indian or Alaska Native (e.g., Navajo Nation, Blackfeet Tribe, Muscogee (Creek) Nation, Mayan, Doyon, Native Village of Barrow Inupiat Traditional Government)

  • Native Hawaiian or Other Pacific Islander (e.g., Native Hawaiian, Samoan, Guamanian or Chamorro, Tongan, Fijian, Marshallese)

  • Middle Eastern or North African (e.g., Lebanese, Egyptian)

  • Not Listed (please specify)

Gender:

  • Woman

  • Man

  • Transgender

  • Non-binary/Gender non-conforming

  • Prefer to self identify (please write in your preferred identity here)

Households with children under 18

  • calculated from general question on household composition, which includes a relationship and birthday question, which are in turn used to calculate if household has children under 18

  • assume this is a binary variable 1/0 for 1 = household with children under 18

Disability status:

  • placeholders for now.

0.5 Estimates

This table shows the probabilities that we are working with in the current iteration of our fake data. These are a combination of empirical estimates and rough guesses (for now). The values for child in household have already been modified to increase representation in the sample by 2xs.

Table 1: Parameters for weighting
sub_group props
race_ethnicity
White (not latino) 0.756
Hispanic 0.100
Black or African American 0.014
Asian 0.051
American Indian or Alaska Native 0.002
Native Hawaiian or Other Pacific Islander 0.001
Middle Eastern or North African 0.038
Not Listed 0.038
gender
Woman 0.398
Man 0.502
Transgender 0.030
Non-binary/Gender non-conforming 0.030
Prefer to self identify 0.040
child_household
No 0.600
Yes 0.400
disability
None 0.850
Disability1 0.050
Disability2 0.050
Disability3 0.050

This table shows the sums across sub-groups as an initial internal check. They should generally sum to 1. The values for child household have already been manipulated to ensure twice as many households with children are included.

Table 2: Proportions for each group (should = 1, a simple comprehension check
group group_sum
child_household 1
disability 1
gender 1
race_ethnicity 1

0.6 Sim data

0.6.1 Population

Fake data for an arbitrary notion of the ‘total population’. This means all the people in Boulder living between 30 and 60% AMI. Right now this is 25000 people.

A few example rows from the simulated population sample:

Table 3: Sample rows from our fake data
id race_ethnicity gender child_household disability
18190 White (not latino) Woman No None
18374 White (not latino) Woman Yes None
1018 White (not latino) Woman Yes Disability2
3145 White (not latino) Woman No None
23489 White (not latino) Man Yes None
8901 Asian Non-binary/Gender non-conforming Yes Disability3

0.6.2 Enrollees

Randomly select 4000 from the population.

Table 4: Proportions in randomly selected enrollee data
sub_group count proportions target_proportions
child_household
No 2433 0.608 0.600
Yes 1567 0.392 0.400
disability
Disability1 213 0.053 0.050
Disability2 213 0.053 0.050
Disability3 196 0.049 0.050
None 3378 0.845 0.850
gender
Man 1995 0.499 0.502
Non-binary/Gender non-conforming 133 0.033 0.030
Prefer to self identify 180 0.045 0.040
Transgender 123 0.031 0.030
Woman 1569 0.392 0.398
race_ethnicity
American Indian or Alaska Native 9 0.002 0.002
Asian 198 0.050 0.051
Black or African American 57 0.014 0.014
Hispanic 411 0.103 0.100
Middle Eastern or North African 137 0.034 0.038
Native Hawaiian or Other Pacific Islander 5 0.001 0.001
Not Listed 158 0.040 0.038
White (not latino) 3025 0.756 0.756

Note: as a reminder/clarifier, in the above table the ‘proportions’ column is what we observe when we select 4000 rows/individuals from our simulated population data. The target_proportions are the values used to simulate the population data. These values will generally be very similar because when you sample a large-ish population at random you will mostly tend to maintain the proportions of its characteristic parts. No weighting is applied at this step because we assume that those who apply to the program are something like a random sample of all those who could apply (the ‘population’).

0.6.3 Select sample 1

To select the first sample wave of 200 individuals from our 4000 applicant pool we first take a weighted sample of the data using the proportions in Table 4.

0.6.4 Select samples 2 and 3

The second wave selection works by taking the wave 1 selection and then using an algorithm to find each individuals closest match from the 3800 individuals remaining in the applicant pool. This is done using a technique called propensity score matching (Ho et al. 2011).

The third wave of sampled individuals is done with the same process.

0.6.4.1 Match the cases

0.7 Viz the waves

First, lets compare the population data to the applicant data:

group sub_group target_props props_raw ideal_counts count_applicant props_applicant count_w1 props_w1 count_w2 props_w2 count_w3 props_w3
race_ethnicity Native Hawaiian or Other Pacific Islander 0.001 0.001 1 5 0.001 3 0.015 2 0.010 NA NA
race_ethnicity American Indian or Alaska Native 0.002 0.002 1 9 0.002 2 0.010 1 0.005 1 0.005
race_ethnicity Black or African American 0.014 0.014 3 57 0.014 13 0.065 12 0.060 11 0.055
gender Transgender 0.030 0.030 6 123 0.031 28 0.140 30 0.150 29 0.145
gender Non-binary/Gender non-conforming 0.030 0.030 6 133 0.033 18 0.090 17 0.085 24 0.120
race_ethnicity Middle Eastern or North African 0.038 0.038 8 137 0.034 24 0.120 27 0.135 27 0.135
gender Prefer to self identify 0.040 0.040 8 180 0.045 6 0.030 5 0.025 4 0.020
race_ethnicity Not Listed 0.038 0.038 8 158 0.040 8 0.040 9 0.045 8 0.040
disability Disability2 0.050 0.050 10 213 0.053 15 0.075 14 0.070 7 0.035
race_ethnicity Asian 0.051 0.051 10 198 0.050 10 0.050 9 0.045 11 0.055
disability Disability3 0.050 0.050 10 196 0.049 11 0.055 10 0.050 7 0.035
disability Disability1 0.050 0.050 10 213 0.053 8 0.040 8 0.040 9 0.045
race_ethnicity Hispanic 0.100 0.100 20 411 0.103 14 0.070 14 0.070 15 0.075
child_household Yes 0.400 0.200 80 1567 0.392 81 0.405 80 0.400 90 0.450
gender Woman 0.398 0.398 80 1569 0.392 62 0.310 64 0.320 60 0.300
gender Man 0.502 0.502 100 1995 0.499 86 0.430 84 0.420 83 0.415
child_household No 0.600 0.800 120 2433 0.608 119 0.595 120 0.600 110 0.550
race_ethnicity White (not latino) 0.756 0.756 151 3025 0.756 126 0.630 126 0.630 127 0.635
disability None 0.850 0.850 170 3378 0.845 166 0.830 168 0.840 177 0.885
Proportions by race group in simulated population data.

Figure 1: Proportions by race group in simulated population data.

Proportions by gender in simulated population data.

Figure 2: Proportions by gender in simulated population data.

We can examine just the race and gender breakdowns, above, to see that randomly sampling 4000 individuals from our population of 25000 leads to proportions in each group that are fairly similar.

Next, we can see how the proportions in each sampling wave compare to the ‘ideal’ proportions in the population data:

Proportions by racial grouping, sampling waves.

Figure 3: Proportions by racial grouping, sampling waves.

Proportions by gender, sampling waves.

Figure 4: Proportions by gender, sampling waves.

Proportions of households with a child in the home, by sampling wave.

Figure 5: Proportions of households with a child in the home, by sampling wave.

Proportions by disability status, sampling wave.

Figure 6: Proportions by disability status, sampling wave.

References

Ho, Daniel E., Kosuke Imai, Gary King, and Elizabeth A. Stuart. 2011. MatchIt: Nonparametric Preprocessing for Parametric Causal Inference” 42. https://doi.org/10.18637/jss.v042.i08.